A robust cusum control chart for median absolute deviation based on trimming and winsorization

Statistical quality control is concerned with the analysis of production and manufacturing processes. Control charts are process control techniques, commonly applied to observe and control deviations. Shewhart control charts are very sensitive and used for large shifts based on the basic assumption of normality. Cumulative Sum (CUSUM) control charts are effective for identifying that may have special causes, such as outliers or excessive variability in subgroup means. This study uses a CUSUM control chart problems structure to evaluate the performance of robust dispersion parameters. We investigated the design structure features of various control charts, based on currently defined estimators and some new robust scale estimators using trimming and winsorization in different scenarios. The Median Absolute Deviation based on trimming and winsorization is introduced. The effectiveness of CUSUM control charts based on these estimators is evaluated in terms of average run length (ARL) and Standard Deviation of the Run Length (SDRL) using a simulation study. The results show the robustness of the CUSUM chart in observing small changes in magnitude for both normal and contaminated data. In general, robust estimators MADTM and MADWM based on CUSUM charts outperform in all environments.


Introduction
Statistical process control (SPC) is a method used in quality control to apply statistical techniques for monitoring and managing a system.The initiation of SPC occurs during the planning phase of a product or service when the relevant attributes are specified.In 1931, Shewhart introduced the concept of control charts, a pivotal technique in SPC.However, the effectiveness of these control charts diminishes when the assumption of normality is violated, and outliers are present in the data.
For enhanced robustness, it is desirable to have control charts that are less influenced by violations of fundamental assumptions.The selection of control charts depends on the process attribute under consideration and the type of change or shift quantity to be evaluated.Control charts are broadly classified into two categories: memoryless control charts and memory control charts.
Memoryless control charts, often referred to as Shewhart-type control charts, are less sensitive to small and moderate parameter changes in location and dispersion.On the other hand, memory control charts, such as CUSUM control charts [1-3] and exponentially weighted moving average (EWMA) control charts [4][5][6][7], which are designed to address issues related to outliers and deviations from normality.
The CUSUM charts have gained popularity in quality control due to their simplicity and efficiency, initially used for monitoring mean levels of processes [8,9].However, their application for measuring process variability has received less attention.Hawkins suggested a robust chart for individual observations based on winsorization, while Lucas and Crosier explored methods to enhance the robustness of standard CUSUM charts [10][11][12].
The study by Lee et al. [13], proposed CUSUM charts for systematically correlated data, Wang et al. introduced a nonparametric CUSUM chart focused on the Mann-Whitney statistic, and Wang et al. [14,15] suggested an adaptive multivariate CUSUM chart.Moustafa [16] introduced modified Shewhart charts for median and median absolute deviations as robust location and dispersion estimators.
Ou et al. [17,18] conducted a comparison study on the performance of various control charts, including standard X charts, CUSUM, and sequential probability ratio test SPRT control charts, considering special situations such as trimmed and winsorized means.Wang et al. [19] introduced Trimmed and Winsorized means for transformed data based on scaled deviation, which proved to be more robust.
The Maxwell CUSUM control chart, proposed by Hossain et al. [20], efficiently monitors failure rates in boring processes.The VCUSUM chart, based on a Maxwell distribution, has been developed to detect tiny changes in a process.Castagliola et al. [21] used the CUSUM median chart, and Moustafa et al. [22] suggested MTSD-TCC, a robust control chart based on the modified trimmed standard deviation (MTSD) as an alternative to Tukey's control chart (TCC).
This paper aims to enhance the efficiency of CUSUM control charts by modifying the use of dispersion parameters and comparing the efficiency of robust estimators in different environments.The investigation includes the performance of CUSUM control charts in uncontaminated and contaminated environments with symmetric and asymmetric variance disturbances, as well as non-normal environments, using Average Run Length (ARL) and Standard Deviation of the Run Length (SDRL).
To facilitate interpretation, the discussion will focus on the upper side of the CUSUM control charts, with a note that double-sided CUSUM control charts exhibit qualitative similarity.The remaining sections of the paper are organized as follows: Section 2 describes dispersion estimators, Section 3 presents proposed estimators, and Section 4 outlines the proposed CUSUM control chart (Fig 1) with different robust dispersion estimators based on trimmed and winsorization.Finally, major conclusions are summarized in the closing section.

Description of process dispersion estimators
Let ϑ be the parameter of the process dispersion that needs to be controlled by control charts and Ŵ be the estimator based on a sample of size n.For Ŵ there are several choices.David [23] gives a clear description of standard deviation estimators.Typical estimators are the average of the sample standard deviations, pooled sample standard deviation, and average of sample ranges.Mahmoud et al. [24] investigated the relative ability of estimators for different k samples of size n.Schoonhoven et al. [25] considered various estimators of the population standard deviation and presented a detailed overview of their efficiency and use for different stages in the control chart.
The following estimators are used in this paper, which is described: The first estimator of ϑ is the sample standard deviation S defined as: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where Y i indicates the i th observation of sample size n and Y indicate the sample mean.In a normally distributed environment, the sample standard deviation S is the most effective estimator but is strongly influenced by outliers.The sample standard deviation breakdown point (the ratio of outlying observations that an estimator can deal with) is zero.
The sample interquartile range (IQR) is the next estimator based on CUSUM-Ŵ charts which are defined by where Q 1 and Q 3 are the first and the third quartiles of the sample, respectively.The sample interquartile range is more stable than the sample standard deviation [26].The breakdown point of IQR is 25%.
The median absolute deviation from the sample median (MADM) is a very robust dispersion estimator rather than the sample standard deviation.It calculates the differences of the data from the median of the sample.The MADM is defined as: where the sample median is Ỹ .For the parameter of interest, the constant 1.4826 is required to make the estimator compatible.In case of normal distribution, σ normal parameter is required to set 1.4826.(In the case of an unbiased estimator of σ, we need to set this constant to 1.4826 if a random sample is taken from a normal distribution.Median Absolute Deviation is 1.4826 times the Median of Absolute Differences of Individual Values of a Dataset from the Median of the Dataset) (Supporting Data).

Proposed estimators based on trimmed winsorization
The trimmed mean is a relatively robust estimate of the centre, which decreases the effect of outliers or large tails by eliminating the observations at the distribution.
The breakdown point is calculated by the number of trimmings thus BDP = α.A basic rule of thumb is to deduct from each tail of the distribution 10% of the observations (i.e., set α = 0.2).Mean deviation from trimmed mean MDTM is defined as: The next proposed estimator is the median of the absolute deviations from the trimmed mean, MADTM is defined as: The method of substituting a given number of extreme values with having small values has become known as winsorizing data or winsorization.Let Y 1 , Y 2 , � � �, Y n , represents observations on a variable from a random sample of size n.The data of Y values are sorted from smallest to largest, i.e Y 1 � Y 2 � � � � � Y n , and the smallest k values are replaced with the smallest (k+1) st values.The same process is valid for the largest values, substituting the largest k values with the largest (k+1) st value.The mean is known as the winsorized mean in this new set of numbers.The winsorized mean is a robust, unbiased approximation of the population mean if the data are from a symmetric population.The k times winsorized mean Y W is defined as: The mean deviation from the winsorized mean MDWM is The next proposed estimator is the median of the absolute deviations from the winsorized mean, MADWM is defined as: For comparison and to determine the precision of the dispersion robust estimators used in this analysis, the standardized variances of the estimators as proposed by Rousseeuw and Croux [27] and relative efficiencies of the estimators as suggested by Abbasi and Miller [28] are calculated.
The dispersion estimator Ŵ of standardized variance (SVŴ) is measured as: To obtain a normal measure of the precision of a scale estimator the denominator of SVŴ is necessary [29].The estimator's relative efficiency (REŴ ) is calculated as: First, the SVŴ and REŴ values for all robust estimators are computed and compared

The proposed method of CUSUM charts for different robust dispersion estimators
For the CUSUM procedures, identify a way to increase the dispersion process parameter ϑ.Let Ŵ be an estimator from Section 2 of the dispersion process parameter ϑ from a random sample of size n that is taken.atregular intervals from a continuous production process.The CUSUM-Ŵ chart is defined as: According to Tuprah and Ncube [30] where Y 0 = 0 and the reference value of the scheme is KŴ.Y t is plotted against the sample number t.The process is assumed to be out of reach if Y t > HŴ (where HŴ defines the decision interval) for any value of t and it is concluded that the dispersion of the process has increased.The procedure of average run length is the expected value of the run length of the process and the random variable run length for the sample number at which Y t > HŴ.The HŴ values are selected such that changes in the dispersion of process parameters are easily identified.When the system is in control in all the scenarios considered in this analysis, HŴ values are selected for a fixed value of ARL along with the KŴ value and is denoted by ARL 0 .ARL 1 stands for the out-of-control ARL, which is predicted to be as small as possible.The reference value KŴ is based on Tuprah and Ncube [30], Ewan and Kemp [31], and E.S. Page [32], so the value KŴ is taken as half of the expected values of Ŵ given ϑ 0 = 1 and the expected values of Ŵ given ϑ 1 = 1.4,where ϑ 0 is the target value and ϑ 1 is the value of dispersion process that needs to be easily detected.E.S. Page [32] in Table 1, presented the reference values for noticing a change (that is ϑ 1 = 1.40 to ϑ 1 = 2.23) easily in the dispersion of the process using the sample range.
Accordingly, for KŴ it is difficult to find the value of E Ŵj: � � analytically.For this purpose, simulation is used, from normal distribution random samples are generated with mean ϑ 0 = 1 respectively, ϑ 1 = 1.40, and variance equal to one and it calculates the said expected value.
The results of CUSUM-Ŵ charts are obtained in the following scenarios based on Tatum [33] and Schoonhoven et al. [25].1.A model in which all observations are from N(0,1) (i.e., uncontaminated scenario).

2.
A symmetric variance disturbances model, in which each observation has a 99% probability from the distribution N(0,1) and a 1% probability from N(0,9).
3. A model of asymmetric variance disturbances, in which each observation is taken from an N(0,1) and has a 1% probability of adding a multiple of a w 2 1 variable to it, with a multiplier equal to 4.
In different scenarios (normal and non-normal) HŴ values are searched by selecting random samples separately from the environments described until the value of HŴ is obtained in each case.An iterative method is used to modify the desired ARL as well as the KŴ reference value.Table 4 is given with ARL 0 = 500 and the values of HŴ.Similarly, alternative values of HŴ can be found for other values of ARL 0 .Since the ARL 0 of the CUSUM-Ŵ chart's results are prone to these values, the KŴ and HŴ values must be carefully selected.

Evaluation of CUSUM-Ŵ charts performance
The ARL is used as simulation method to evaluate the performance of the suggested CUSUM -Ŵ charts.The ARL of in-control and out-of-control systems is calculated using the monte carlo simulation.The descriptions of the simulation are: 20000 random samples of size n were created from the different scenarios (i.e.normal, contaminated normal, or non-normal) and the dispersion estimators concerned with some recent estimators (i.e. S, IQR, and MADM) as well as some suggested robust estimators (i.e.MDTM, MDWM, MADTM, and MADWM) based on trimming and winsorization at (10%, 20%, and 25%) are measured.Tables 3 and 4 are used to generate the corresponding limits of the control chart.It is noted that the sample number at which statistic Y t lies beyond the control limits, this sample number is known as run-length, and it is a random variable.To determine the run length distribution, the same process is repeated 12000 times.The ARL represents the average of the run length distribution and SDRL represents the standard deviation of the run length distribution.To determine the run lengths a code has been built in the R language.

Results and discussions
The ARL 1 and SDRL 1 are used in different environments to evaluate the performance and efficiency of the CUSUM-Ŵ charts.In terms of ϑ (i.e δϑ) we have identified shifts which specify that the shifted dispersion parameter Ŵ is defined as Ŵ ¼ dW.Here δ = 1 indicates that there is no shift in ϑ and the dispersion of the process is constant, and δ > 1 indicates that the process ϑ has increased.ARL 1 increases when the process shift decreases.SDRL decreases as the size of the process shift increases.It depends on the size of the shift.When the process is in control, the ARL and SDRL process to be close to its targeted value namely 500 In all environments, robust MADTM and MADWM estimators based on CUSUM charts work well.

Uncontaminated environment.
All observations are normally distributed in an uncontaminated environment N(0,1).This environment is the fundamental assumption of the design structure of each chart.This provides a conceptual framework for comparing the various types of control charts and the suggested CUSUM-Ŵ chart.Table 5 shows the results of ARL.
A large value of ARL is desired when the process is stable or in control.In Table 5 the bold letter shows the highest score of ARL of robust estimators at different levels of trimming and winsorization with sample sizes of n = 5 and n = 9.It can be seen that the Standard deviation S based on the CUSUM-Ŵ chart of sample size n = 5 has the best performer as compared to IQR, MADM highlighted values in Table 5.The proposed estimator MDTM (at 10%, 20%, and 25% trimming) performance is best for both sample sizes (n = 5 and 9) as compared to S, IQR, MADM.For both sample sizes n = 5 and n = 9 when the shift δ > 1. 25 the MADTM (at 10%, 20%, and 25% trimming) and the MADWM (at 10%, 20% and 25% winsorizing) performs better as compared to the S, IQR, and MADM.The ARL of proposed estimator the MADTM(at 10%, 20%, and 25% trimming) and the MADWM (at 10%, 20%, and 25% winsorizing) are large than all other estimators for both sample size (n = 5 and 9).It shows that the performance of both proposed estimators is best.
To further clarify the distribution of run lengths in an environment of the uncontaminated case, the SDRL of the CUSUM-Ŵ charts is often recorded to measure the performance of runlength as proposed by Antzoulakos and Rakitzisis [34].Table 6 shows the details.The SDRL process is to be close to its targeted value namely 500 when the process is in control.Table 6 shows that SDRL has a significantly lower value than their targeted value for certain CUSUM-Ŵ chart and SDRL decreases for all charts as to the δ increases.

Symmetric variance environment.
A symmetric variance distribution is used when the spread parameter has been disturbed.In such an environment, we examined the performance of the suggested estimators with their corresponding CUSUM charts in which each observation has a 99% probability that is derived from normal distribution N(0,1) and 1% probability taken from normal distribution N(0,9).Tables 7 and 8 present the ARL and SDRL results of symmetric variance for sample sizes n = 5 and n = 9.
From Tables 7 and 8   MADTM (at 10%, 20%, and 25% trimming) for both sample sizes n = 5 and n = 9 has shown best overall performance than other estimators for all shifts of the dispersion process.The MADWM (at 10%, 20%, and 25% winsorizing) is very sensitive when the sample size is small n = 5 but as the sample size increases (n = 9) the MADWM (at 10%, 20% and 25% winsorizing) performs well as compared to S, IQR, and MADM.The shift δ > 1.25 the IQR, the MADTM (at 20% and 25% trimming) and MADWM (at 10% winsorizing) are good for small sample size n = 5 when the sample size is large n = 9 the MADTM (at 10%, 20%, and 25% trimming) and MADWM (at 10%, 20% and 25% winsorizing) performs best as compared to other estimators in the increasing shift of the dispersion process.

Asymmetric variance environment.
In an asymmetric variance environment, each observation is taken from normal distribution N(0,1) and has a 1% probability of adding a multiple of w 2 1 Chi-Square with one degree of freedom to it with a multiplier equal to 4. Tables 9 and 10 show the results of ARL and SDRL respectively for sample sizes n = 5 and n = 9.The above Table 9 of ARL clearly illustrates that for a small sample size n = 5 the S and MADM are better than MADWM (at 20% and 25% winsorizing) but less efficient than the other estimators.When the sample size is small i.e n = 5, IQR performance is good based on CUSUM-Ŵ charts as compared to S, MADM, MDTM (at 10% trimming) and MADWM (at 20% and 25% winsorizing).The larger values of ARL are highlighted.For a small sample size n = 5 MDTM (at 20% and 25% trimming) is better than S, IQR MADM.For a large sample size n = 9 is better than IQR and MADM.The performance of MDTM (at 10%, 20%, and 25% trimming), MDWM (at 10%, 20%, and 25% winsorizing), and MADWM (at 10%, 20% and 25% winsorizing) is best for large sample size n = 9 and more efficient as compared to S, IQR, and MADM.The MADTM (at 10%, 20%, and 25% trimming) shows superior performance to other estimators in increasing all shifts of the dispersion process for both sample sizes n = 5 and n = 9.When δ > 1.25 IQR, MADM and MADTM (at 20%, and 25% trimming) outperform all other estimators for both sample sizes of n = 5 and n = 9.

Non-normal environment.
The samples prepared in this way are transformed without loss of generality.One way to get the resulting sample with zero mean and one variance.For this reason, the mean is subtracted from each sample taken from the non-normal environment and then divided by the non-normal environment of the standard deviation to determine the correct result and comparable performance.
Tables 11 and 12 present the ARL values of different estimators to predict an increase in dispersion process at different magnitudes for in-control ARL O = 500 and sample size n = 5 when underlying process distribution are Gamma and Logistic.The following are some important outcomes of ARL and SDRL values of Gamma distribution G(2,1).

Conclusion
In this paper, several estimators of dispersion parameters are considered for use in the development of Phase II control limits.These include some widely used estimators as well as robust estimators that are uncommon in the literature of control charts.The robust dispersion parameter was monitored using the CUSUM-Ŵ control chart structure for these estimators.In different environments, the results of these robust estimators are evaluated.The uncontaminated environment, different contaminated environments symmetric variance, asymmetric variance disturbances and non-normal environments.All charts perform well under the   environments like uncontaminated environments and different contaminated environments with symmetric, asymmetric variance disturbances and non-normal environment.

Table 10 . SDRL values of robust estimators based on CUSUM-Ŵ charts under asymmetric variance contaminated environment when ARL O = 500.
uncontaminated environment, but the CUSUM-Ŵ control chart based on the MADTM (at 20% and 25% trimming) and MADWM (at 10%, 20% and 25% winsorizing) outperform all estimators under normality for large sample size n = 9.The performance of suggested estimators MDTM (at 10%, 20%, and 25% trimming) and MADTM (10%, 20%, and 25% trimming) are good for both sample sizes n = 5 and n = 9 in symmetric variance and asymmetric variance environment.When the environment is non-normal the estimators MDTM (at 25%